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FINAL  TECHNICAL  PROGRESS  REPORT 


Abstract 

The  aim  of  the  research  summarized  in  this  final  report  was  to  investigate  a  class 
of  orthogonal  shared-memory  architectures  and  interconnection  networks,  and  to  obtain 
generalized  methods  for  implementing  algorithm-based  fault  tolerance  (ABFT)  on  mul¬ 
tiprocessor  architectures. 

We  proposed  a  theory  based  on  orthogonal  graphs  to  represent  many  well-known 
interconnection  networks  such  as  the  binary  m-cube,  spanning-bus  meshes,  multistage 
interconnection  networks,  etc.  A  previously  proposed  multiprocessor  architecture  called 
the  Orthogonal  Multiprocessor  (OMP)  is  also  a  special  case  of  this  method.  The  sim¬ 
plicity  of  the  graph  construction  rules  permits  us  to  characterize  and  understand  the 
differences  and  similarities  among  networks  like  the  SW-banyan,  the  baseline  networic, 
among  others.  This  opens  the  way  for  discovering  new  structures  by  studying 
different  possible  combinations  of  the  parameters  which  define  orthogonal  graphs. 

In  the  area  of  ABFT,  we  proposed  general  synthesis-for-fault-tolerance  methods 
for  multiprocessor  architectures  based  on  dependence  graphs.  Integrating  fault  toler¬ 
ance  during  synthesis  allows  us  to  reduce  the  overheads  considerably.  At  the  same 
time  it  allows  us  to  attack  problems  which  could  not  be  treated  in  any  general  way 
before.  Most  of  the  existing  ABFT  techniques  can  be  obtained  by  our  method.  No 
such  method  has  ber  presented  before.  We  next  proposed  methods  for  designing 
ABFT  systems  with  optimal  number  of  checks  using  randomized  algorithms,  where 
no  known  deterministic  method  could  provide  optimality.  We  successfully  applied  the 
randomized  method  to  the  problems  of  s-eiror  detectability,  s-error  diagnosability  and 
easy  diagnosis.  We  then  proposed  a  design  method  for  t-fault  detectable/locatable 
ABFT  systems  and  presented  bounds  on  the  different  parameters  used  in  such  a  sys¬ 
tem.  We  considered  the  application  of  ABFT  techniques  to  massively  parallel  systems. 
We  presented  a  low-overhead,  high  fault-coverage  ABFT  scheme  for  FFT  networks. 
We  finally  presented  fault-detecting/locating  schedules  for  computations  DAG’s  imple¬ 
mented  on  multiprocessor  systems. 
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A  Summary  of  Overall  Progress 

This  section  summarizes  the  progress  made  with  the  help  of  the  Grant  AFOSR- 
90-0144  awarded  by  Air  Force  Office  of  Scientific  Research  jointly  to  the  principal 
investigators,  Niraj  K.  Jha  and  Isaac  Scherson.  We  give  below  a  brief  account  of  the 
research  accomplished  during  the  two  years  for  which  funding  was  given. 

Eleven  conference  and  six  journal  papers  (all  in  IEEE  Transactions)  have  been 
written  acknowledging  this  grant.  Among  the  journal  papers,  one  has  been  published, 
one  accepted  for  publication,  one  has  been  revised,  one  is  under  review,  and  two  are 
about  to  be  submitted  in  the  near  future.  One  of  the  conference  papers  won  the  Best 
Paper  Award  at  the  International  Conference  on  Parallel  Processing,  1990.  Some  other 
papers  have  been  presented/accepted  at  prestigious  conferences  like  the  IEEE  Interna¬ 
tional  Symposium  on  Fault  Tolerant  Computing.  The  acceptance  rates  at  many  of 
these  conferences  is  as  low  as  20%.  Copies  of  the  papers  are  also  being  sent  with  this 
report.  The  research  work  can  be  broken  up  into  two  areas:  (1)  Construction  and 
fault  diagnosis  of  interconnection  networks  (IN’s),  and  (2)  Algorithm-Based  Fault 
Tolerance  (ABFT)  techniques  for  parallel  processing  architectures. 

A.  Construction  and  Fault  Diagnosis  of  INs 

In  a  Best-Paper-Award  winning  paper  [1]  and  its  journal  version  [2],  we  showed 
how  an  orthogonal  graph-based  representation  of  a  class  of  interconnection  networks 
can  be  obtained.  The  proposed  theory  is  applicable  to  many  well-known  interconnec¬ 
tion  networks  such  as  the  binary  m-cube  and  spanning-bus  meshes.  Orthogonal  graphs 
were  also  used  for  the  construction  of  multistage  interconnection  networks.  We  pro¬ 
vided  connectivity  and  placement  rules  and  showed  that  these  yielded  a  large  number 
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of  well-known  networks.  We  had  previously  proposed  a  multiprocessor  architecture 
called  the  Orthogonal  Multiprocessor  (OMP)  (this  architecmre  was  also  independently 
proposed  by  a  group  at  USC).  In  OMP,  access  is  defined  by  either  rows  or  columns, 
hence  the  name.  It  was  successfully  applied  to  many  vector  processing  problems. 
OMP  is  a  particular  case  of  the  generalized  theory  based  on  orthogonal  graphs  that 
was  developed  in  [1].  Mapping  of  orthogonal  graphs  as  switching  netwoiks  leads  to 
the  generation  of  multistage  interconnection  netwoiks  (MIN’s).  The  simplicity  of  the 
graph  construction  rules  permits  the  description  of  well-known  networks  as  well  as  the 
understanding  of  their  differences  and  similarities.  These  networks  include  the  SW- 
banyan,  the  baseline  network,  among  others.  The  importance  of  this  approach  lies  in 
the  fact  that  hypercube- like  machine  equivalences  can  be  easily  stated.  Routing  in 
orthogonal  graphs  is  shown  to  reduce  to  the  node  covering  problem  in  bipartite  graphs. 
We  believe  that  by  studying  different  possible  combinations  of  the  parameters  which 
define  orthogonal  graphs,  new  structures  may  be  discovered.  Of  particular  importance 
is  its  applicability  to  the  study  of  reliable  systems.  A  systematic  methodology  might 
be  found  which  yields  the  desired  degree  of  fault  tolerance  based  on  formal  parameters 
that  define  the  multiprocessing  system. 

We  extended  the  previous  work  done  in  [1,  2]  in  [3].  We  showed  that  a  very 
simple  relaxation  of  the  omega  graph  (a  particular  case  of  orthogonal  graphs)  construe 
tion  rule  leads  to  an  interesting  class  of  shared-memory  systems.  Because  this  rule  can 
be  interpreted  as  the  composition  of  omega  graphs  of  different  dimensions,  the  result¬ 
ing  graphs  are  called  multidimensional  orthogonal  graphs.  A  design  methodology  was 
proposed  for  the  implementation  of  the  interconnection  network  lequired  by  a  func¬ 
tional  shared-memory  system.  The  methodology  results  in  the  definition  of  the  type  of 
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multistage  network  which  connects  a  number  of  processors,  on  one  side,  to  a  larger 
number  of  memory  modules  on  the  output  side.  The  network  is  consistent  with  the 
constraint  that  all  processors  gain  conflict-free  access  to  the  bank  of  memory  modules 
for  different  access  modes.  We  feel  that  more  work  on  mapping  scalable  algorithms 
onto  these  shared  memory  systems  may  well  lead  to  the  automatic  generation  of 
special-purpose  feasible  systems. 

In  [4]  we  showed  how  to  embed  binary  trees  in  orthogonal  graphs.  This  is 
important  since  binary  trees  are  common  in  computational  algorithms.  From  the 
embedding  procedure,  an  isomorphic  binary  tree  is  generated  with  a  node  labeling 
order  similar  to  the  traversal  order  of  a  breadth-first  spanning  tree.  In  the  case  of 
orthogonal  graphs  describing  multidimensional  access  memory  configurations,  the 
embedded  isomorphic  tree  can  be  traversed  with  only  two  link  modes. 

In  [5]  we  looked  at  the  problem  of  routing  between  any  two  nodes  of  an  orthogo¬ 
nal  graph.  This  problem  can  be  reduced  to  a  node  covering  problem  in  the  bipartite 
coverage  graph.  A  minimum  cover  results  in  the  shortest  path.  In  general,  the 
minimum  node  cover  problem  is  NP-complete.  However,  because  of  the  regular  pat¬ 
tern  of  edges  for  the  bipartite  graph  for  the  orthogonal  graphs,  a  minimum  cover  can 
be  found  in  time  polynomial  in  the  number  of  bit-nodes  of  the  bipartite  graph.  So  the 
complexity  is  only  quadratic  in  the  logarithm  of  the  number  of  nodes  in  the  original 
orthogonal  graph. 

A  key  step  in  providing  network  reliability  and  fault  tolerance  in  MIN’s  is  the 
detection  and  location  of  faulty  elements  in  it.  In  previous  methods,  the  test  results  are 
distributed  among  all  processors.  This  requires  a  host  processor  to  collect  all  test 
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results  and  diagnose  the  system  for  faulty  elements  based  on  the  syndrome.  However, 
such  a  centralized  scheme  requires  the  host  processor  and  the  bus  to  be  a  hardcore  (i.e. 
their  failure  makes  the  scheme  useless).  Furthermore,  in  a  system  with  no  host,  one 
processor  has  to  be  dedicated  just  for  fault  diagnosis.  In  [6]  a  distributed  on-line  fault 
diagnosis  method  for  a  multi-path  MIN,  the  Augmented  Shuffle  Exchange  Network 
(ASEN),  is  presented.  Such  networks  depend  on  an  effective  diagnosis  scheme  to  per¬ 
form  fault  tolerant  routing.  The  test  patterns  applied  in  this  method  can  be  applied 
asynchronously,  which  means  that  a  processor  can  apply  its  tests  whenever  it  is  not 
busy.  Thus,  the  system  performance  is  not  degraded  by  this  method. 

B.  ABFT  Techniques  for  Multiprocessor  Architectures 

Algorithm-based  fault  tolerance  (ABFT)  is  a  low-overhead  fault  tolerance  scheme 
for  high-speed  parallel  processing  systems.  To  minimize  the  effect  of  erroneous  data 
in  such  systems,  ABFT  schemes  employ  concurrent  error  detection.  In  other  words, 
erroneous  data  are  detected  concurrently  with  normal  operation.  ABFT  systems  are 
also  useful  for  concurrent  fault  location,  once  errors  produced  by  a  fault  have  been 
detected.  The  data  produced  by  an  ABFT  system  are  encoded  at  the  system  level.  The 
encoded  data  are  then  used  to  verify  that  the  system  is  fault-free  using  a  set  of 
“checks”. 

While  many  ABFT  schemes  exist  for  obtaining  fault  tolerant  implementations  of 
particular  algorithms  on  particular  architectures,  not  much  has  been  done  on  obtaining 
a  generalized  method.  In  [7,  8]  we  proposed  a  general  synthesis-for-fault-tolerance 
approach  to  attack  this  problem.  In  this  approach,  rather  than  adding  fault  tolerance 
features  after  the  system  has  been  synthesized,  we  add  these  features  during  the  syn- 
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thesis  process  itself.  This  allows  us  to  reduce  the  hardware  and  rime  overhead 
required  for  fault  tolerance  considerably.  This  method  is  based  on  dependence  graphs 
which  are  extensively  used  in  the  design  of  VLSI  array  processors.  Most  of  the  exist¬ 
ing  ABFT  schemes  presented  by  previous  researchers  can  be  obtained  by  our  method. 
Thus  our  method  unifies  existing  results.  At  the  same  time,  it  provides  a  framework 
for  attacking  problems  which  were  not  even  considered  in  much  detail  before.  For 
example,  our  method  can  be  used  to  obtain  ABFT  schemes  for  non-linear  problems, 
whereas  most  of  the  previous  methods  have  limited  themselves  to  linear  problems 
only. 

An  ABFT  system  is  said  to  be  /-fault  (error)  detectable/diagnosable  if  simultane¬ 
ous  faults  in  /  or  fewer  processors  (simultaneous  errors  in  /  or  fewer  data  elements) 
can  be  detected/located  at  run-rime.  Such  systems  are  typically  modeled  by  a  tripartite 
graph  consisting  of  processor,  data  and  check  nodes.  In  [9]  we  provide  bounds  on  the 
various  parameters  needed  to  design  such  systems.  We  also  give  a  design  method 
which  uses  a  composition  technique  for  deriving  complex  ABFT  systems  from  simpler 
unit  systems.  For  the  design  of  /-fault  detectable  systems,  we  allow  sharing  of  data, 
whereas  the  previous  method  does  not.  Another  advantage  of  our  method  is  that  the 
checks  are  uniform,  i.e.  all  the  checks  have  the  same  parameters.  This  can  consider¬ 
ably  ease  the  design  of  the  checks  as  well  as  make  the  hardware  and  time  overhead 
incurred  by  the  different  checks  the  same,  which  is  obviously  desirable. 

Designing  checks  to  locate  or  detect  errors  in  the  data  is  an  important  problem 
and  plays  an  important  role  in  the  area  of  ABFT.  We  presented  solutions  to  this 
problem  in  [10].  Our  checks  are  assumed  to  be  of  the  simplest  kind,  i.e.  a  check  can 
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operate  without  any  restriction  on  any  subset  of  the  data  and  can  reliably  detect  up  to 
one  error  in  this  set  of  data.  In  [10]  we  showed  how  to  design  the  data-check  (DC) 
relationship  optimally,  i.e.  using  the  least  possible  number  of  checks.  For  the  first 
time,  we  gave  a  general  procedure  for  designing  checks  to  locate  s  errors,  given  any 
value  for  s .  We  also  considered  the  problem  of  designing  checks  to  detect  s  errors  in 
the  data.  This  is  the  first  optimal  construction  for  this  problem.  The  procedure  for 
designing  the  checks  are  simple  and  novel.  We  showed  how  one  can  modify  these 
constructions  to  produce  uniform  checks,  i.e.  checks  which  are  identical  and  check  the 
same  number  of  elements.  We  also  gave,  for  the  first  time,  a  method  for  constructing 
the  DC  relationship  so  that  a  linear-time  diagnosis  algorithm  can  be  used  for  diagnos¬ 
ing  (locating)  faults  in  the  system.  If  a  system  is  not  designed  for  easy  diagnosability, 
present  diagnosis  algorithms  require  exponential  time.  Finally,  we  presented  methods 
for  constructing  the  DC  graph  for  systems  which  arc  simultaneously  s  -error  diagnos- 
able  and  t -error  detectable,  i  >  s.  Thus,  in  this  paper  we  have  presented  the  only  or 
the  best-known  solutions  to  three  major  problems  and  two  minor  problems  in  the  area 
of  ABFT.  These  results  are  also  shown  to  be  optimal  or  near-optimal.  They  can  be 
used  along  with  any  general  technique  for  designing  fault  tolerant  PDC  graphs. 

In  [11]  we  considered  the  applicability  of  ABFT  to  massively  parallel  scientific 
computation.  Existing  ABFT  schemes  can  provide  effective  fault  tolerance  at  a  low 
cost  for  computation  on  matrices  of  moderate  size;  however,  the  methods  do  not  scale 
well  to  floating-point  operations  on  large  systems.  This  paper  proposes  the  use  of  par¬ 
titioned  linear  codes  to  provide  scalability.  Matrix  algorithms  employing  this  scheme 
are  presented  and  compared  to  current  ABFT  schemes,  with  respect  to  numerical  sta¬ 
bility  and  hardware/time  overhead.  The  partitioned  scheme  provides  scalable  linear 
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codes  with  improved  numerical  properties  with  only  a  small  increase  in  hardware  and 
running  time  overhead.  Many  ABFT  schemes  have  been  proposed  in  the  past  for  fast 
Fourier  transform  (FFT)  networks.  In  [12]  we  propose  a  new  ABFT  scheme  for  FFT 
networks.  We  show  that  our  new  approach  maintains  the  high  throughput  of  the  previ¬ 
ous  schemes,  yet  needs  lower  hardware  overhead  and  achieves  higher  fault  coverage 
than  previous  schemes  by  Jou  et  al  and  Tao  et  al. 

In  [13]  we  investigate  issues  concerning  the  construction  of  minimal-length  fault¬ 
detecting  and  fault-locating  schedules  for  computation  DAG’s  implemented  on  mul¬ 
tiprocessor  systems.  The  basic  idea  used  here  is  to  duplicate  computations  by  using 
the  idle  processors  and  perform  comparisons  on  these  duplicated  computations  to 
detect  or  locate  the  faults.  Earlier  work  in  this  area  focussed  entirely  on  constructing 
fault-secure  schedules.  We  develop  conditions  for  a  schedule  to  be  fault-detecting  or 
fault-locating  and  further  use  these  conditions  to  propose  schemes  for  construction  of 
the  schedules.  Lowerbounds  on  the  length  of  the  schedules  are  calculated  and  it  is 
shown  that  our  schedules  meet  the  lowerbounds  in  most  cases.  A  method  for  actual 
fault  diagnosis  from  the  results  of  the  fault-locating  schedules  is  also  proposed. 
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